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Abstract 


The Discrete Cosine Transform (DCT) stands apart from other 
orthogonal transforms because of its favorable comparison to the 
Karhunen-Loeve Transform (KLT). However, there is no fast 
algorithm to compute the KLT, which makes the DCT an attractive 
alternative. This book presents two 8X8 DCT routines and is 
divided into the following pieces: 


Q The DCT algorithm 


Q Implementation in the TMS320C25 and TMS320C30 
processors 


TMS320C25 code for a roundoff routine 
Signal flow graphs for 2-2-point, 4-point, and 8-point DCTs 
TMS320C30 code for bit reversal 


Oo oO UO 


Execution times and memory requirements 


The appendices at the end of the book contain code for the DCT 
algorithms for both the TMS320C25 and TMS320C30 processors. 
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An 8X8 Discrete Cosine Transform Implementation on the TMS320C25 or the TMS320C30 


Introduction | 


In the general class of orthogonal transforms, there exists one in particular, the 
discrete cosine transform (DCT), that has recently gained wide popularity in signal pro- 
cessing. The DCT has found applications in such areas as data compression, pattern recogni- 
tion, and Weiner filtering, primarily because of its close comparison to the Karhunen-Loeve 
Transform (KLT) with respect to rate distortion criteria [1]. Although the KLT is con- 
sidered to be optimal, there is no fast algorithm to compute it. Since there is no fast KLT 
algorithm, the DCT is an attractive alternative. 


For image coding, the DCT works well because of the high correlation among adja- 
cent data samples (pixel values). Because of this correlation, the DCT provides near op- 
timal reduction while retaining high image quality. In a comparative study [2], the DCT 
was shown to outperform the Fourier, Hartley, and cas-cas transforms for image com- 
pression, providing even more motivation for finding fast implementations. 


A number of algorithms have been developed, most notably those of Hou [3] and 
Lee [4], which generate higher-order DCTs from lower-order ones. This paper presents 
two 8X8 DCT routines, one for the TMS320C25 and another for the TMS320C30, based 
upon the routine in [3]. 


The DCT Algorithm 


For a given real data sequence X0,X], - . .,Xn.1, the discrete cosine transform is 
given in [1] as 
N=1 
w= aly LY Xp Cos OO WN ce oe cS (la) 
N n=0 2N 


and its inverse is 


Nal 
in = af SE abzy cos (T28* DEV p91... NH (1b) 
N k=0 2N 
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where a (k) = [2 for k = 0; otherwise, the transform is unitary. If zg is scaled up 
by 2, the DCT can also be written in matrix form as 


f2 
= —T ; 2 
z 7 (N) x (2) 


where x and z are column vectors denoting the input and output data sequences, and 7(N) 
is the DCT matrix of order N. Actually, expanding the matrix (neglecting the factor of 


~. for the moment), a 4-point DCT appears as 
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where a = ae B = cos (3). and 6 = sin (3). Similarly, the 8-pt DCT can be 


expressed as 


ra) 1 1 1 1 1 1 1 1 xo 
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22 B -6 -8 6 -6 -g 56 x4 
%6 6 B -6 -8 6 B —-5 -£6 X6 
Z = nN —vy -y -A -ph PY x7 14) 
zs Be voe-y » -p -vy vy —-dr x5 
23 y -S # vw -y RN —H HP x3 
27 poy NN pb -—vy =—y -vA —H x} 


where ) = cos (Fz): y = cos ee ph = sin 2), and vy = sin (Fe): Note that 


the input is no longer in natural order but has been rearranged according to the permutation 
matrix P and the relation 


x = Px, (5) 


where 
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Figure 1. Signal Flow Graphs for 2-Point, 4-Point, and 8-Point DCTs 


The structure of the algorithm looks very much like that of a Fast Fourier Transform 
(FFT), since the most fundamental computation is a 2-point butterfly. This routine is actually 
a generalized case of the Cooley-Tukey FFT algorithm with the addition of the recursion 
at the end. If the equations for the signal flow graph are written explicitly, the recursive 
nature of the DCT becomes clear; for a 4-point DCT, we have 


20 = 20; 
22 = 22; 


2z3 — 21, 


N 
a 
ll 


and for the 8-point DCT, 


20 = 2%, 
24 = 24, z 
22 = 22, 
26 = %6 
a ara 


23 = 223 — 2), 
25 = 275 — 23, 
27 = 227 — 25. 


To create a unitary transform, each element in the vector should be multiplied by 


the scaling factor 2 for both the forward and inverse transforms. The inverse 
N 
transform is obtained by completely reversing the direction of the signal flow graph; i.e., 


performing the bit-reversal first, then the recursions and the butterflies, and finally, the 
data permutation. 


For the two-dimensional case of interest, the DCT can be described in the form 


N=1 N=-1 
zk) = 2 ak) a) LX SS x(mn) cos ca) cos cS) (8a) 
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N=-1 N=1 
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1 
where a (k) = 2 for k = 0, unity otherwise. Like the FFT, the DCT kernel is 


separable, allowing the transform to be performed in two steps, first along the rows and 
then the columns. 


Implementation on the TMS320C25 
The DCT algorithm may be carried out in one of two ways, either using 


1. A matrix formulation, where the DCT coefficients are simply multiplied by 
the data, or 


2. The signal flow graph. 


This routine uses a matrix formulation, which requires the sixty-four cosine 
coefficients to be stored in an array in memory. The matrix formulation is based on the 
following equation: 
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where \ = cos (6), y = cos Gra , ph = sin (3) and py = sin (2): 


The algorithm described above has been shown to be numerically stable for fixed- 
point processors; however, to prevent serious data errors, truncation and roundoff must 
be accounted for. A roundoff technique similar to the one in [6], is used to prescale the 
matrix coefficients by (2!5 - 1). This product is then loaded into the accumulator with 
a one-bit left shift, effectively dividing it by 215. After a multiplication is performed, the 
32-bit value in the accumulator must be rounded to sixteen bits, where bits 13,14, and 
15 are used to determine the value of the sixteenth bit. The TMS320C25 performs this 
operation in a single instruction by adding 3000h to the accumulator product with a one- 
bit left shift, as outlined in the code shown in Figure 2. 
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INITIALIZE MATRIX COEFFICIENTS AND ROUNDOFF VALUES INTO 


INTERNAL BLOCK O 
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Figure 2. TMS320C25 Code for Roundoff Routine 


After the multiplications are computed, the results are stored in another array area 
in transposed order; thus, a separate routine for transposing the matrix is not needed. Once 
the rows are transformed, the pointers for the input and output matrices are exchanged. 
When the procedure is repeated, the output is stored as rows, completing the transform. 
Appendix A contains a complete program listing for the forward transform on the 
TMS320C25. To perform an inverse DCT, the table of cosine coefficients should be 
replaced with those used for an inverse transform. 


Implementation on the TMS320C30 


The TMS320C30’s increased speed and flexible addressing modes can reduce 
execution time substantially. In using the FFT-like structure, extraneous multiplications 
are removed, and because of the TMS320C30’s ability to perform parallel 
multiplication/additions, two butterflies can be computed at once. After an initial subtraction 
is done, the coefficient multiplication can be executed in parallel with the addition of the 
data. The TMS320C30’s floating-point capability eliminates not only the problems of 
roundoff error associated with fixed point processors but also the need for any truncation 
routines. 


Because the DCT size is fixed to eight points, there are only four locations that need 
exchanging; this allows for a fast bit-reversal of the data. When using the TMS320C30’s 
extended-precision registers for temporary storage, the transfers can be done in-place. 
These data transfers are also done in parallel, since two load or store operations can be 
performed simultaneously. The code for performing the bit reversal is shown in Figure 
3 below. 


CORRECT ORDER FROM BIT REVERSED TO NATURAL 


* 


BITREV LDF *ARO,RO ; | ONLY FOUR LOCATIONS ARE 
\| LDF *-AR2,R1 : ACTUALLY SWITCHED 
STF R1,*ARO 
|| STF RO,*-AR2 
LDF *AR1,RO 
|| LDF *-AR3,R1 
STF R1,*AR1 
|| STF RO,*-AR3 


Figure 3. TMS320C30 Code for Bit Reversal 


Because of the amount of data shuffling that occurs, an eight-word scratch-pad vector 
has been created with four permanent pointers set up at every other memory location. 
This allows access to each element in the vector (by predecrement or preincrement 
addressing) without requiring constant alteration of one or two pointer locations. Although 
there is no overhead for looping on the TMS320C30, straight-line coding is used as much 
as possible to increase performance. 


You can transpose the DCT matrix in the same way as in the TMS320C25 
implementation: namely, store the transformed row vector as a column vector in another 
matrix and interchange the input and output pointers. 


The complete routines for the forward and inverse transforms are given in Appen- 
dix B. 


Results 


The execution times and memory requirements for the two routines are given in 
Table 1. For the TMS320C30 implementation, the forward transform contains the scale 
factor of 2, so the transform is not unitary. When the signal flow is reversed, 


instructions accumulate and the time required to perform the inverse transform actually 
increases (see Table 1). This increase occurs because certain multiplications cannot be 
performed in parallel with another instruction. The two times are identical on a TMS320C25 
because it uses a matrix routine to compute the transform. 


Table 1. Execution Times and Memory Requirementst 


, Memory Required 
Device Time Required (ys) 
Program Data 


TMS320C25-50 232 words * 203 words * 205.8 (forward) 
(matrix) 232 words 203 words 205.8 (inverse) 
TMS320C30-40 125 words ** 138 words ** 33.6 (forward) 
(signal-flow) 112 words 137 words 31.9 (inverse) 
TMS320C30-40 115 word ** 128 words ** 65.8 (forward) 
(matrix) 115 words 128 words 65.8 (inverse) 


Timprovements have been made and are shown in this table. You may obtain the latest code from 
the BBS, (713) 274-2323. 

* TMS320C25 wordlengths are 16 bits. 

* *TMS320C30 wordlengths are 32 bits. 


Summary ‘ 


Two routines for a two-dimensional Discrete Cosine Transform are presented: one 
for the TMS320C25 and one for the TMS320C30, with a development of the algorithm 
given for clarification. This report also discussed the similarities of the DCT to the Cooley- 
Tukey FFT algorithm and arithmetic shortcuts which can reduce the DCT’s execution 
time. Although these implementations use the most recent formulation, there is still room 
for investigation into more efficient methods. Another approach that might prove fruitful 
is to deal with the entire 8 x8 array all at once, as suggested by Haque [7], rather than 
transforming the array by rows and columns. However, both routines given in the 
appendices provide fast, numerically stable solutions for applications requiring the DCT. 
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